You can view, download or pull the raw code for this content here https://github.com/jwiley/MonashHonoursStatistics
These are the R packages we will use.
options(digits =2)## new packages are lme4, lmerTest, and multilevelToolslibrary(tufte)library(haven)library(data.table)library(JWileymisc)library(lme4)
Loading required package: Matrix
library(lmerTest)
Attaching package: 'lmerTest'
The following object is masked from 'package:lme4':
lmer
The following object is masked from 'package:stats':
step
library(multilevelTools)library(visreg)library(ggplot2)library(ggpubr)## Set your working directory to the folder that has the datafiles.## "Merged" is a a merged long dataset of both baseline and daily diarydm <-as.data.table(read_sav("[2021] PSY4210 merged.sav"))
1 Linear Mixed Models (LMMs) - Introduction
Often, observations are not independent. For example:
Repeated measures data, such as in longitudinal studies; repeated measures experiments, where observations are clustered within people
When individuals are clustered or grouped, such as people within families, schools, companies, resulting in clustering within the higher order unit
Clustered data versus repeated measures or longitudinal data may seem quite different, but from a statistical perspective, they pose many of the same challenges that are solved in basically the same ways. We will focus on repeated measures for now, but note the statistical methods apply to both contexts.
Here is some hypothetical data on a few people where systolic blood pressure (SBP) was measured at three time points:
ID
SBP1
SBP2
SBP3
1
135
130
125
2
120
125
121
3
121
125
.
This data is stored in “wide” format. That is each repeated measure is stored as a separate variable. For LMMs, we typically want data in a long format, like this
ID
Time
SBP
1
1
135
1
2
130
1
3
125
2
1
120
2
2
125
2
3
121
3
1
121
3
2
125
in a long format, data are stored with each measure in one variable and an additional variable to indicate the time point or which specific assessment is being examined. In long format, multiple rows may belong to any single unit. Not all units (here IDs) have to have the same number of rows. For example some people like ID3 may have missed a time point. With clustered data, you could have a large or small school with a different number of students (where student is the repeated measure within school rather than time points within a person). The reshape() function in R can be used to reshape data in a wide format to a long format or from a long format to a wide format. See the “Working with Data” topic for examples and details.
Regardless of wide or long format, these data are clearly not independent. The different blood pressure readings within a person are likely more related to each other than to blood pressure readings from a different person.
1.1 Considering Non Independence
In previous PSY4210 cohorts there was a small data collection exercise where students completed some questions every day (like the BL dataset but repeated measures). We load that data using the read_sav() function and plot the daily energy ratings from a few different people in the following figure.
Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_segment()`).
Not only do observations vary across days within a person, but different people have different mean energy levels.
Our first thoughts on analysing data where observations are not independent, may be to consider methods we already know, such as linear regression (or GLMs) or ANOVAs.
You might think about would be ANOVA. We could use a repeated measures ANOVA. RM ANOVA can handle repeated measures if they are:
at discrete time points (e.g., T1, T2, T3)
everyone has the same number of time points
the outcome is continuous, normally distributed data
However, RM ANOVA has many limitations. It cannot handle…
continuous time (e.g., if one person completes on days 0, 13, 22, and another on 0, 1, 20)
if someone is missing data on any timepoint, they are completely excluded from analaysis (unless you do something like imputation)
continuous predictors/explanatory variables (e.g., age in years) nor other types of outcomes
If ANOVAs are not ideal, what would happen if we use a linear regression? The following figure shows the linear regression lines for the association between energy and self esteem for each individual participant (just four for example) and the single line that is the result of an overall linear regression (in blue).
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
The overall linear regression has to be the same for everyone because in linear regression we have one intercept and one slope. The fact that different people may have different overall levels of energy or self esteem and the fact that the association between energy and selfesteem may not be the same for all people cannot be captured by the linear regression.
Statistically, the linear regression assumption of independence also would be violated making the standard errors, confidence intervals, and p-values from a linear regression on non-independent data biased and wrong.
Both the fact that regular regression / GLMs cannot capture individual differences in the level of each line (different intercepts by ID) nor the fact that different people may have a different relationship between the predictor and outcome (different slopes by ID) and the issue around independence stem from the fact that in the regressions / GLMs we have learned so far, all the coefficients are fixed effects, meaning that they are fixed (assumed) to be identical for every participant.
When you have only one observation per participant, fixed effects are necessary as it is impossible to estimate coefficients for any individual participant. We can only analyze by aggregating across individuals. With repeated measures, however, it is possible and important to consider whether a coefficient differs between people.
If a regression coefficient differs for each participant we say that it varies or is a random variable. Because of this we call coefficients that differ by person random effects, which are different from fixed effects because they are not assumed to be identical for everyone, but instead vary randomly by person.
1.2 LMM Theory
Linear Mixed Effects Models (LMMs) extend the fixed-effects only linear regression models covered in previous units to include both fixed and random effects. That is, to be able to include regression coefficients that are identical for everyone (fixed effects) and regression coefficients that vary randomly for each participant (random effects).
Because LMMs include both fixed and random effects, they are called mixed models.
Let’s see how this works starting with the simplest linear regression model and then moving into the simplest linear mixed model.
The simplest linear regression model is an intercept only model.
\[ y_i = b_0 * 1 + \varepsilon_i \]
The intercept here, \(b_0\), is the expected outcome value when all other predictors in the model are zero. Since there are no other predictors, this will be the mean of the outcome. The fixed effect means the linear regression model assumes that all participants have the same mean. Shown graphically, here are a few participants’ data with a blue dot showing the mean for each person. The regression line with an intercept only will be the blue line. It assumes that the mean (intercept) for all people is identical, which is not true. We need to make \(b_0\) a random effect – to allow it to vary randomly across participants.
ggplot(dd[ID %in%c(10, 18, 40, 64, 97)],aes(ID, dEnergy)) +geom_point(alpha = .2) +stat_summary(fun = mean, colour ="blue", shape =18) +stat_smooth(method ="lm", se =FALSE, formula = y ~1) +theme_pubr()
Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_segment()`).
We can see the actual distribution of mean energy levels by ID easily enough as follows.
plot(testDistribution(dd[, .(MeanEnergy =mean(dEnergy, na.rm =TRUE)), by = ID]$MeanEnergy),varlab ="Mean (average) Energy by ID")
The key point of the previous graph is to show that indeed, different participants have different mean energy levels. The mean energy levels vary and in this case appear to fairly closely follow a normal distribution. The means are a random variable.
If we assume that a random variable follows a particular distribution (e.g., a normal distribution), we can describe it with two parameters: the mean and standard deviation.
Assuming a random variable comes from a distribution gives us another way to think about fixed and random effects.
Fixed Effects approximate the distribution by: \(M\) = estimated mean; \(SD\) = 0 (i.e., the standard deviation is fixed at 0)
Random Effects approximate the distribution by: \(M\) = estimated mean; \(SD\) = estimated standard deviation (the SD is free to vary).
In this way, you can think of both fixed and random effects as both using a distribution to approximate a random variable (the association between say energy and self esteem for different people) but fixed effects make the strong assumption that the distribution has \(SD = 0\) whereas random effects relax that assumption and allow the possibility that \(SD > 0\).
1.3 Intraclass Correlation Coefficient (ICC)
In repeated measures / multilevel (twolevel) data, we can decompose variability into two sources: Variability between individuals and variability within individuals.
We call the ratio of between variance to total variance the Intraclass Correlation Coefficient (ICC). The ICC varies between 0 and 1.
0 indicates that all individual means are identical so that all variability occurs within individuals.
1 indicates that within individuals all values are the same with all variability occurring between individuals.
A basic understanding of these equations is helpful as it let’s us interpret the ICC. Suppose that every person’s mean was identical. There would be no variation at the between person level. That is, \(\sigma^{2}_{between} = 0\). That results in:
\[ ICC = \frac{0}{0 + \sigma^{2}_{within}} = 0 \]
When there are no differences between people, \(ICC = 0\).
What happens if there is no variation within people? What if all the variation is between people? That is, \(\sigma^{2}_{within} = 0\). That results in:
When there are no differences within people, \(ICC = 1\). The ICC is the ratio of the differences between people to the total variance. A small ICC near 0 tells us that almost all the variation exists within person, not between person. A high ICC near 1 tells us that almost all the variation occurs between people with very little variation within person. The ICC can be interpretted as the percent of total variation that is between person.
The following figure shows two examples. The High Between variance graph would have a high ICC and the Low Between variance graph would have a low ICC.
set.seed(1234)ex.data.1<-data.table(ID =factor(rep(1:4, each =10)),time =rep(1:10, times =4),y =rnorm(40, rep(1:4, each =10), .2))ex.data.2<-data.table(ID =factor(rep(1:4, each =10)),time =rep(1:10, times =4),y =rnorm(40, 2.5, 1))ggarrange(set_palette(ggplot(ex.data.1,aes(time, y, colour = ID, shape = ID)) +stat_smooth(method ="lm", formula = y ~1, se=FALSE) +geom_point() +theme_pubr(), "jco"),set_palette(ggplot(ex.data.2,aes(time, y, colour = ID, shape = ID)) +stat_smooth(method ="lm", formula = y ~1, se=FALSE) +geom_point() +theme_pubr(), "jco"),ncol =1,labels =c("High Between", "Low Between"))
Example showing the difference between high and low between person variance in data.
Note
Beyond the scope of this unit, sometimes we have even more levels of data, like repeated observations measured within people and people measured within classrooms or cities, etc. We can similarly calculate ICCs for each level, i.e., city, person, etc. To do this using the iccMixed() function, you would just add more ID variables.
We can calculate the ICC in R using the iccMixed() function. It requires the name of the dependent variable, the ID variable, indicating which rows in the dataset belong to which units, and the dataset name. It returns the standard deviation (Sigma) attributable to the units (ID; the between variance) and to the residual (the within variance). The column labelled ICC is the proportion of total variance attributable to each effect. The main ICC would be the ICC for the ID, here 0.40 for stress.
iccMixed("dStress", id ="ID", data = dd)
Var Sigma ICC
<char> <num> <num>
1: ID 0.78 0.4
2: Residual 1.18 0.6
The ICC can differ across variables. For example, one variable might be quite stable across days and another very unstable. When working with repeated measures data, the ICC for each repeated measure variable is a useful descriptive statistic to report. For example, the following calculates the ICC for energy, 0.28, which is lower than that for stress, indicating that energy is less stable from day to day than is stress, or equivalently that compared to energy, stress has relatively more variation between people than within them.
iccMixed("dEnergy", id ="ID", data = dd)
Var Sigma ICC
<char> <num> <num>
1: ID 0.52 0.28
2: Residual 1.36 0.72
1.4 Between and Within Effects
The ICC illustrates an idea that comes up a lot in multilevel or mixed effects models: the idea of between and within effects. A total effect or the observed total score can be decomposed into some part attributable to between and some part attributable to a within effect as shown in the following diagram.
Decomposing multilevel effects
Between effects exist between units whereas within effects exist within a unit. In psychology, often the unit is a person and the between effects are differences between people whereas the within effects are differences within an individual.
Note
When you subtract the mean from a variable, the new mean will be equal to 0. For example, if you have three numbers:
\[ (1, 3, 5) \]
the mean is 3, if you subtract the mean you get:
\[ (-2, 0, +2) \]
and the mean of the deviations are 0.
If you have:
\[ (4, 5, 6) \]
the mean is 5 and if you subtract the mean you get:
\[ (-1, 0, +1) \]
as deviation scores and the mean of these deviations is 0.
It is a fact that the mean of deviations from a mean will always be 0. Because of how we normally calculate a within person variable, from a total variable, we separate out the mean into the between portion and the within portion has deviations from individuals’ own means and so the mean of the within portion will be 0.
For example, suppose that in one individual, Person A, over three days, we observed the following scores on happiness: 1, 3, 5. Those total observed scores could be decomposed into one part attributable to a stable, between person effect (the mean) and another part that purely captures daily fluctuations, the within person effect (deviations from the individuals’ own mean). We could break down those numbers: 1, 3, 5 as shown in the following figure.
Decomposing multilevel effects - concrete example
Notice that the between person effect does not vary across days. It is constant for a single individual. The within person portion, however, does vary across days.
In this case, the between person effects would be the mean for each person, matching a different intercept for each person from our ICC example. The variance in these means (intercepts) is the between person variance. The within person effects are individual deviations from individuals’ own means and their variance would be the residual variance. Together those variances add up to the total variance and these portions are what the ICC captures.
1.5 Descriptive Statsitics on Multilevel Data
With multilevel data, basic descriptive statistics can be calculated in different ways. We will use the merged dataset from the daily collection exercise. This dataset merges the baseline and daily datasets in a single, long format file.
To start with, suppose we had such a merged file (generally, a convenient way to store and analyze multilevel data) and we wanted to calculate descriptive statistics on some between person variables, such as loneliness and sex. We might start off as we have before using egltable() for some summary descriptive statistics.
egltable(c("loneliness", "sex"), data = dm)
M (SD)
<char> <char>
1: loneliness 2.14 (0.59)
2: sex 1.85 (0.36)
Note that as sex is coded 1/2 and not a factor, by default we are given a mean and standard deviation. This is not so helpful. We could convert to a factor or set strict=FALSE in which case egltable() does not strictly follow the class of each variable and instead if a variable only has a few unique values, assumes its categorical/discrete, regardless of its official class in R.
egltable(c("loneliness", "sex"), data = dm, strict=FALSE)
M (SD)/N (%)
<char> <char>
1: loneliness 2.14 (0.59)
2: sex
3: 1 43 (15.2%)
4: 2 240 (84.8%)
To make things even clearer, let’s remind R what 1 and 2 represents.
dm[, sex :=factor( sex,levels =c(1,2),labels =c("male", "female"))]## While we are at it, let's also tell R that relationship ## status is a factor toodm[, relsta :=factor( relsta, levels =c(1,2,3), labels =c("single", "in a committed exclusive relationship","in a committed nonexclusive relationship"))]## now let's redo the egltableegltable(c("loneliness", "sex"), data = dm, strict=FALSE)
M (SD)/N (%)
<char> <char>
1: loneliness 2.14 (0.59)
2: sex
3: male 43 (15.2%)
4: female 240 (84.8%)
## now that R knows sex is a factor, we can remove the strict argumentegltable(c("loneliness", "sex"), data = dm)
M (SD)/N (%)
<char> <char>
1: loneliness 2.14 (0.59)
2: sex
3: male 43 (15.2%)
4: female 240 (84.8%)
Now we can see the mean and standard deviation for loneliness and the frequencies and percentages of female and male participants in the study. However, this reveals a problem/error. There were not 43 male and 240 female participants in the study. What this actually shows is the average loneliness, weighted by the number of days of data available for each participant, and the total number of daily surveys completed by male and female participants. That happens because the between person variables are repeated in long format.
Let’s look at the data just for the first two participants in the following table. We can see that on all days, loneliness and sex scores are identical for IDs 10 and 12.
This is how between person variables are typically stored when in a dataset combined with repeated measures data in long format. However, it renders descriptive statistics on them probably not what we really wanted. The means are essentially weighted means, participants who complete more days of data get a higher weighting because their loneliness values are repeated more times. For categorical/discrete variables, what we get is the number of assessments / observations for each level of the categorical variable, in this case number of observations belonging to male and female participants. In this case, when using data tables, the solution is easy: drop duplicated rows and only keep one row of data per ID and then remake the table. The following code shows how to do this. The difference is in using data = dm[!duplicated(ID)] instead of data = dm.
egltable(c("loneliness", "sex"), data = dm[!duplicated(ID)])
M (SD)/N (%)
<char> <char>
1: loneliness 2.07 (0.58)
2: sex
3: male 12 (13.6%)
4: female 76 (86.4%)
Yielding this nice descriptive statistics table:
knitr::kable(egltable(c("loneliness", "sex"), data = dm[!duplicated(ID)]))
M (SD)/N (%)
loneliness
2.07 (0.58)
sex
male
12 (13.6%)
female
76 (86.4%)
When working with multilevel data and a mix of between variables (especially sociodemographics that are often asked once) and repeated measures variables, watch out for which are which type and select only one row per ID when calculating descriptives or graphs for between person variables. The same applies when making exploratory plots or figures.
plot(testDistribution(dm$loneliness), varlab ="this is wrong")
This is wrong, has many duplicates for each loneliness score
plot(testDistribution(dm[!duplicated(ID)]$loneliness), varlab ="this is right")
This is right, only has one loneliness score per person.
For repeated measures variables or any variable that can vary within a unit, we have several options for how to calculate descriptive statistics. The choices depend on whether the variable is continuous or categorical/discrete.
1.5.1 Continuous Variables
With multilevel data in long format, if we calculate the mean and variance of a variable, that would average across units and observations within units for the mean. The variance will incorporate both differences between units and variance within units (how much the data points vary around each unit’s own mean). That is, it is essentially the total variance.
Conversely, if we first average observations within a unit (e.g., person), then the mean will be the average of the individual averages, and the variance will only be the variability between individual units’ means. That is, we could first create a between person variable by averaging scores for each unit and then calculate descriptives as usual for that variable.
dm[, c("Bstress", "Wstress") :=meanDeviations(dStress), by = ID]egltable(c("Bstress"), data = dm[!duplicated(ID)])
M (SD)
<char> <char>
1: Bstress 4.08 (1.11)
We can interpret the descriptive statistics for Bstress as, “The mean and standard deviation of the average level of stress across days was 4.08 (SD = 1.11).”
For continuous variables, we also could calculate descriptives on all data points directly rather than on the average by person. Using all data points can unequally weight different participants (e.g., imagine one participant contributed 100 data points and 100 other participants each contributed 1 data point, the average will be 50% the 100 participants with 1 data point and 50% from the 1 participant with 100 data points). The issue of weighting tends to be less important to the extent that clusters are all about the same size, and makes no difference if all clusters are identical (e.g., everyone has exactly 10 observations). If there are no systematic differences such that, for example, participants with the highest or lowest levels of stress tend to have more/fewer data points, then the mean is likely to be quite similar regardless of the approach.
The standard deviation from all data points is more like the total variance, combining variance from between and within participants. Thus, the standard deviation we expect to be at least the same and most likely larger than the standard deviation of the individual means, since those individual means “smooth” or remove variation within person.
egltable(c("dStress"), data = dm)
M (SD)
<char> <char>
1: dStress 4.13 (1.41)
In this example we can see that while the mean changes by .05, the standard deviation changes by quite a bit more, reflecting the added within person variance.
A final approach that can be taken for continuous variables is to only pick a single observation from each unit to use in calculations. This tends to work best with longitudinal data where, for example, you could pick either the first day or the last day and calculate descriptives just for that one day as an example. To do this, we use the overall variable, but we subset the dataset to only include one day.
egltable(c("dStress"), data = dm[SurveyDay ==1])
M (SD)
<char> <char>
1: dStress 4.03 (1.29)
This we can interpret as any other “usual” descriptive statistics. It is the average level of stress and standard deviation across participants, on the first day. If there are few time points (e.g., in a longitudinal study with just baseline, post, and follow-up), it might make sense to simply report the means and standard deviations of continuous variables at each time point. With a long dataset, this can easily be done by specifying the time point as the grouping variable. Note:egltable() does not calculate correct tests, effect sizes, or p-values when the grouping variable is a repeated measures and not independent groups, so if you do it this way, ignore the tests and p-values, which assume independence.
egltable(c("dStress"), g ="SurveyDay", data = dm)
1 M (SD) 2 M (SD) 3 M (SD) 4 M (SD) 5 M (SD)
<char> <char> <char> <char> <char> <char>
1: dStress 4.03 (1.29) 4.30 (1.40) 4.35 (1.47) 4.11 (1.43) 3.81 (1.54)
Test
<char>
1: F(4, 278) = 1.13, p = .345, Eta-squared = 0.02
If the data are already in wide format, one would simply calculate descriptive statistics on each of the separate variables representing each time point.
1.5.1.1 Summary
With multilevel, continuous variables, three approaches we discussed for calculating descriptive statistics are:
average by person and report descriptives on the individual averages;
report descriptives on the overall variable which captures the total variance but possibly unequally weights participants;
report descriptives on individual timepoints/assessments.
1.5.2 Categorical Variables
Compared to continuous variables, there are fewer options for presenting descriptive statistics with multilevel categorical data. For example, suppose people reported each day whether they: walked, rode a bike, or went to the gym for exercise. It does not make sense to average these data per person, although it could be possible to average, for example, the proportion of days each participant engaged in any one of those activities. However, categorical / discrete data are typically presented as frequencies and percentages. Averages of frequencies, especially with skewed data are not all that easy to interpret. For example, suppose that people either ride a bike or walk each day, using averages would appear to show that on average people walk half the days and ride a bike half the days.
The general choices, then for multilevel categorical data descriptives would either be to report descriptives for a variable overall or to report descriptives for a specific time point (e.g., day 1, 2, etc.) of longitudinal data.
## overallegltable("Int_Fri", data = dm, strict =FALSE)
These results show the total frequency and percentage of interactions with a friend (1 = interacted with at least one friend that day) in overall (first table) and the frequency and percentage of people that reported interacting with a friend on day 1 (second table).
1.5.3 Putting it All Together
Calculating and reporting descriptives statsitics from multilevel data stored in long format can require putting everything we talked about together. In this example, we want a descriptive statistics table for the following variables:
sex (measured baseline only; categorical)
loneliness (measured baseline; continuous)
selfesteem (measured baseline; continuous)
dStress (measured daily; continuous, wanted average levels)
Int_Fri (measured daily; categorical)
First for any continuous variables that we want to report on average levels only, dStress, we create a between person version. Then we subset the dataset to one row per person and calculate descriptives and store these in desc1. Then we calculate any descriptives we want on the daily data, here for Int_Fri. Then we name the columns the same using setnames() and finally we combine the two tables by binding them rowise, using rbind() and make a nice table with kable() from the knitr package.
From these results we can see that 86.4% of the participants were female,the average loneliness was 2.07 (possible scores 1-4), the average self-esteem was 3.43 (possible scores 1-5), the average of the mean stress across days was 4.08 (possible scores 1-7), and 64% of the days completed featured an interaction with a friend.
1.6 Linear Mixed Models
Note
Linear mixed models (LMMs) or mixed effects models are also sometimes called “multilevel models” or “hierarchical linear models” (HLMs). The multilevel model or hierarchical linear model names both come from the fact that they are used for data with multiple, hierarchical levels, like observations nested within people, or kids nested within classrooms. However, all three of these are synonyms for the same statistical model, that includes both fixed and random effects, as random effects are the method used to address the hierarchical or multilevel nature of the data.
One difference, however, is that HLMs imply a specific hierarchy, which may not always be present in data. For example, if all siblings from a family were recruited as well as students in the same classroom, there would be crossed effects with any given person a member of a higher level family unit and a higher level classroom unit. HLMs generally are not setup for situations where there is not a pure hierarchy. Mixed effects models address this easily as it just requires different random effects (one for families, one for classrooms, for example). Thus, all HLMs are mixed models, but not all mixed models are HLMs. Another example would be if in an experiment, we randomly sampled difficulty levels and every participant completed all combinations. Difficulty level could be a random effect and participant could be a random effect, with observations nested within both, but they do not form a clear hierarchy (i.e., difficulty level is not above/below participant).
Unlike linear regression, which models all regression coefficients as fixed effects, mixed effects regression models some coefficients as fixed effects and others as random effects (hence the term, “mixed” effects).
A random effect is a regression coefficient that is allowed to vary as a random variable. Random effects provide a way of dealing with non-independence in the data by explicitly modeling the systematic differences between different units.
The simplest linear mixed model (LMM) is an intercept only model, where the intercept is allowed to be a random variable.
\[ y_{ij} = b_{0j} * 1 + \varepsilon_{ij} \]
Here we have an observed outcome, \(y\), with observations for a specific person (unit), \(j\), at a specifc timepoint / lower level unit, \(i\). In addition, note that our intercept, which in linear regression is \(b_0\) is now \(b_{0j}\) indicating that the intercept is an estimated intercept for each unit, \(j\). In psychology our units often are people, but as noted before the “unit”may be something else, such as students within classrooms, or kids within families, etc.
by convention, we decompose random effects, here the random intercept, as follows:
\[ b_{0j} = \gamma_{00} + u_{0j} \]
where
\(\gamma_{00}\) is the mean intercept across all people. There are no variable subscripts under it, only numbers, as it does not vary across different people.
\(u_{0j}\) is the individual unit deviations from the mean intercept, \(\gamma_{00}\).
Under this decomposition, \(\gamma_{00}\), is a fixed effect, the average intercept for all units, and is interpretted akin to what we are used to in linear regression, \(b_0\). \(u_{0j}\) is how much each individual units intercept differs from the average, they are deviations.
We also make another, new distributional assumption, we assume that:
\[ u_{0j} \sim \mathcal{N}(0, \sigma_{int}) \]
In other words, we assume that individual units’ deviations from the fixed effect, average intercept follow a normal distribution with mean 0 and standard deviation equal to the standard deviation of the deviations.
The actual parameters of the model that must be estimated are:
\(\gamma_{00}\), the fixed effect intercept
\(\sigma_{int}\), the standard deviation of the individual deviations from the fixed effect intercept, the \(u_{0j}\)
\(\sigma_{\varepsilon}\), the residual error term
A powerful feature of LMMs is that despite needing to allowing/model individual differences in a coefficient, here the intercept, we do not have to actually estimate a separate intercept parameter for each unit / person. In fact, it takes only one more parameter than a regular linear regression. The only additional parameter is the standard deviation of the individual intercepts. We also have another distribution assumption, by assuming the the random effect intercept also follows a normal distribution, but in exchange for the one extra parameter and one new assumption, we are able to relax the assumption of linear regression that each observation is independent.
To fit a LMM in R, we use the lmer() function from the lme4 package. lmer() stands for linear mixed effects regression, and it follows a syntax very similar to lm(). The main difference/addition is a new section in parentheses to indicate which specific regression coefficients should be random effects and what the ID variable of the unit is that should be used for the random effects. Here we will fit an intercept only LMM on stress from the daily data collection exercise data. We add a random intercept by ID using the syntax: (1 | ID).
Note
Summary of output:
Type of model, how fit (Restricted Maximum Likelihood, REML) and how statistical inference was performed (t-tests using the Satterthwaite’s approximation method to calculate approximate degrees of freedom)
Formula used to define the model is echoed for reference
REML criterion, which is kin to a log likelihood value. We’ll talk about this more in a later lecture.
Scaled residuals, which are rescaled from the raw residuals. Interpret these similarly to usual residuals from linear regression.
A new section: Random Effects. This shows the variance and standard deviation (just square root of variance) for the random intercept and the residual variance (our residual error term from linear regression).
Total number of observations included and the number of unique units, here IDs, representing different people. These numbers are what were included in the analysis.
Fixed Effects regression coefficient table for LMMs is interpretted comparably to the regression coefficients table for a linear regression. One note is that the degrees of freedom, df, cannot readily be calculated in LMMs. The Satterthwaite’s approximation method was used here, but this approximation is not perfect and because it is an estimated degrees of freedom, you can get decimal points, not just whole numbers. The decimals are not an error.
The summary shows information about how the model was estimated and its overall results.
m <-lmer(dStress ~1+ (1| ID), data = dd)summary(m)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: dStress ~ 1 + (1 | ID)
Data: dd
REML criterion at convergence: 947
Scaled residuals:
Min 1Q Median 3Q Max
-3.1329 -0.6595 -0.0215 0.6846 2.2419
Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 0.784 0.886
Residual 1.182 1.087
Number of obs: 283, groups: ID, 88
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 4.101 0.119 85.379 34.5 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can interpret that in this model, there were 283 observations included in the linear mixed model, and these observations came from 88 different people. On average, these people had a stress score of 4.1, the fixed effect intercept, and this was signifciantly different from zero. By examining the random effects standard deviations, Std.Dev. we can see that the random intercept standard deviation is pretty close to the residual standard deviation, which is consistent with the earlier ICC we calculated for stress that nearly half the variance in stress is between people with the other half occuring within people.
Note
In mixed models we do not technically estimate parameters for each individual. That is, we do not directly estimate random effects like \(b_{0j}\). We estimate the average of the distribution, \(\gamma_{00}\) and the standard deviation fo the distribution, \(\sigma_{int}\).
The coef() function then does not report estimated coefficients per se, but what are called Best Linear Unbiased Predictors (BLUPs).
The BLUPs use the model estimated parameters from the LMM along with the data, to basically predict what each person’s own, conditional mean is. This combines both their actual data and the model. People with more data will have a BLUP closer to the observed mean of their own data in an intercept only model. People with less data (say only 1-2 observations) will have a BLUP that is closer to the average mean of all people as its assumed that the mean of their 1-2 data points is likely very noisy/inaccurate due to small sample size.
*Side note: while many of the concepts from LMMs apply to GLMMs just as many concepts from LMs apply to GLMs, the BLUPs or conditional means do not quite generalize. In GLMMs we can get conditional modes, but not conditional means. This only matters if you would be doing, for example, logistic mixed effects regression models.
We can see the fixed effects coefficients only from the model by using the fixef() function. If we want to see the predictions of the coefficients for each person, i.e., using the random effects, we can use the coef() function.
The random intercept coefficient estimates are similar conceptually to what we would calculate if we calcualted the average stress score, per person, across days. However, there are some differences. To see these, we’ll put both into the same dataset.
## make a data table of the random intercepts and IDsrandomintercept <-data.table(ID =as.numeric(rownames(coef(m)$ID)),RI =coef(m)$ID[, "(Intercept)"])## view the first few rowshead(randomintercept)
## calculate the means and number of observations, by IDindividualMeans <- dd[!is.na(dStress), .(Means =mean(dStress),N = .N), by = ID]## merge the two datasets together by IDrimeans <-merge(randomintercept, individualMeans, by ="ID", all=FALSE)## order the dataset by meanrimeans <- rimeans[order(Means)]rimeans[, ID :=factor(ID, levels = ID)]## view the merged datahead(rimeans)
ID RI Means N
<fctr> <num> <haven_labelled> <int>
1: 143 2.3 1.0 2
2: 145 1.7 1.0 5
3: 150 2.0 1.2 4
4: 109 2.4 1.8 4
5: 26 3.3 2.0 1
6: 49 2.5 2.0 5
Now we can make a figure with an arrow for each person that starts at the raw mean and ends at the random intercept estimate for that individual. The fixed effect intercept is added as a line. The individual arrows are coloured based on how many days of stress data each person had available.
ggplot(rimeans, aes(ID, xend = ID, y = Means, yend = RI, colour = N)) +geom_hline(yintercept =fixef(m)[["(Intercept)"]]) +geom_segment(arrow =arrow(length =unit(.15, "cm"))) +coord_flip() +theme_pubr() +ggtitle("Shrinkage From Raw to LMM Means")
Figure showing how compared to raw means, random intercept estimates from LMMs tend to shrink indivdiual estimates towards the overall fixed effect estimate, the solid line in the middle. The degree of shrinkage tends to be greater for individuals that are further away from the fixed effect estimate and is greater for individuals with fewer data points.
This figure highlights how shrinkage is a distinction of using LMMs compared to simply fitting models (here a mean) separately on each individual person. The LMMs tend to shrink estimates towards the overall fixed effect estimate. In other words more extreme estimates get pulled in to be closer to the mean, which has an effect of stabilising relatively extreme values. How much an estimate is shrunken depends on both how extreme it is (the more extreme, the more shrinkage) and on how many observations are available for that unit. If a unit has many data points, it will have less shrinkage.
For some more discussion and examples of this, see:
Finally, we can run model diagnostics on LMMs similarly to how we did for linear regressions. If we run the modelDiagnostics() function on our model, we get an error about an unknown class of type haven_labelled.
md <-modelDiagnostics(m, ev.perc = .001)
This is telling us that modelDiagnostics() is not sure how to work with data of that type in R. Because we know stress is basically numeric data, we can convert stress into a numeric variable in R using as.numeric(). This is basically just a way of telling R that it really is just numeric data. Then we refit our model and again ask for diagnostics. Note that if we had lots of variables in our model, we might have to check the classes of multiple variables to find out which one was the problem. Since we only had stress in the model, we know its that.
## check the R class of stressclass(dd$dStress)
[1] "haven_labelled" "vctrs_vctr" "double"
## convert stress in R to a numeric classdd[, dStress :=as.numeric(dStress)]
Now, we can refit our model and calculate diagnostics. The plot of the residuals and the residuals versus predicted values plots are familiar and are used to assess the same two assumptions for LMMs as for linear regression models. We also get a density plot for ID : (Intercept) the random effect for intercepts. This is to assess the assumption that the random effects, the random intercept coefficients are also about normally distributed. In this case, all the assumptions appear reasonably well met, with only one relatively extreme stress residual noted at -3.13.
In later lectures, we will dig further into estimating LMMs and adding predictors with both fixed and random effects.
2 Data Cleaning for LMMs
A common data cleaning task is to identify and address extreme values. In multilevel / repeated measures data, extreme values may exist at the between person level (i.e., a person who is relatively extreme compared to other people) or at the within person level (i.e., an assessment that is relatively extreme for that person).
In order to assess whether an individual is, on average, similar to or relatively extreme from other indivdiuals or whether a specific data point within an individual is similar to or different from that individuals’ own typical values, we often calculate the mean per person and the deviations from the mean. This essentially decomposes a repeated measures variable into part that is between person and part that is within person. One way to make it clear which is which is to add a prefix, like B for between and W for within.
To clean stress, we will start by calculating the average for each person and the deviations from each person’s own average, using the meanDeviations() function. We will create two new variables in the dataset: Bstress and Wstress and these will be created by ID. If we look at the first few rows of the data, we can see that if we were to add up the between and within stress variables, we would get back the original stress variable.
We are going to start by examining the data at the within person level. The reason for this is that if there is a particular day that is very extreme for a particular person, that extreme value on a day not only will be extreme at the within person level, but because the average stress for a person is calculated by averaging all of their days of data, that extreme value will impact their mean at the between level. So we want to first clean the within person level of data.
Examining the data at the within level helps to identify whether, for a particular person, any observation is extreme/an outlier compared to what is normal for that person. We do this by examining the within stress variable, which is how different each days stress data are for each person from their own mean. For this, we use the original daily dataset because we need the repeated measures included.
Note, we do not always use the top and bottom 0.5% of a theoretical distribution. This is used here as it helps to identify some extreme values and it is relatively conservative. However, for excluding cases, its also common to use a more extreme threshold, like the top and bottom 0.1%, which would be obtained as ev.perc = .001.
In the figure, based on the specified definition of extreme values, the top and bottom 0.5% of a theoretical normal distribution, we can see there are four extremely low stress value and one extremely high stress value.
We can either figure out approximately where those scores are and manually subset the dataset to find out which IDs and which days, or use the testDistribution() function again. Doing it manually requires a bit of guess to pick the correct cut off to find the cases. Using testDistribution() we select the data and pick only those rows that are extreme values using isEV == "Yes" and then look at the OriginalOrder column to see the row numbers in the original dataset, here dd. Then we can use those row numbers to directly look at the correct rows of the dataset.
##create a dd dataset without ID = NAdd2 <- dd[!is.na(ID)]## manually identify ID and day from dd with missing IDs excludeddd2[Wstress <-2.3, .(dStress, Bstress, Wstress, SurveyDay, ID)]
## use testDistribution to find the row numbers of extreme valuestestDistribution(dd2$Wstress,extremevalues ="theoretical", ev.perc = .005)$Data[isEV =="Yes"]
This effort reveals that ID 29 on day 5, ID 23 on day 5, ID 110 on days 1 and 5, and ID 64 on day 1 are relatively extreme values. We could exclude them based on the ID and day or if we used testDistribution() we know exactly which rows to exclude from the dataset. We remove rather than select rows by adding a minus sign in front. We save the new dataset as dd.noev to indicate no extreme values. We could run diagnostics again or call it a day. If we do run another plot, we see one new observation emerge as relatively extreme. It is not too bad and we’ve already done one cleaning pass. Thus, I would not remove the new guy who does not fit in.
Now that we have cleaned the within person level, we need to recreate the between person level. This is because with some specific, extreme days excluded the people impacted, IDs 29, 23, 110, and 64, will have a different average stress value. We use exactly the same code as earlier to make a between and within stress variable, but do it on the dd.noev dataset.
dd.noev[, c("Bstress", "Wstress") :=meanDeviations(dStress), by = ID]
Next, we examine the mean stress values to clean the data at the between person level. In this long dataset, the between person values (mean stress) are repeated for every day the person collected data. To clean the data at the between level, we only want one row of data per person. It does not really matter which row we get, since the between stress variable will be the same for all rows of the same ID, so we can just take rows where ID is not duplicated and store that in a between person daily dataset, dd.b.
dd.b <- dd.noev[!duplicated(ID)]
From here, we can evaluate the distribution and look for any outliers or extreme values just as we did in the Data Visualization 1 topic using testDistribution() and plot().
In this example, average stress looks pretty good. It is about normally distributed based on the density plots. There are a couple of fairly extreme low values, near 1, but they are not that far away from the rest of the data points. As usual, we can get a five number summary from the x axis, showing that the median of the average daily stress scores is 4.2, with an interquartile range from 3.6 to 5.0. The interpretation is as usual, other than we have to keep in mind that what we are plotting is not individual stress scores from specific days, they are average stress scores across days per person. As long as we keep that in mind and interpret based on that, we can interpret this graph same as any other time we would evaluate a variable for outliers, etc.
If there were extreme values, we could follow a similar approach as for the within person data. However, rather than excluding specific days, we would need to exclude entire IDs from the dataset and all rows associated with that ID.
Another approach that can be taken to extreme values is to winsorize them, for example to the 1st and 99th percentiles, or something like that which can help to remove extreme scores without fully excluding those data points.
However, winsorizing with multilevel data is complicated by the fact that any change to the mean also impacts the within level data and that any change to the within level data impacts the mean.
3 Optional Only: Extra Resources
If you want some introduction to mixed effects models for GLMs rather than just linear models, see: